Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Use dataframe api to also support Polars and other dataframe libraries #597

Closed
wants to merge 24 commits into from

Conversation

MarcoGorelli
Copy link
Contributor

@MarcoGorelli MarcoGorelli commented Nov 12, 2023

Description

Discussed with @FBruzzesi a bit on Discord, so I thought I'd try opening a PR to show what this would look like. Just starting off with the pandas_utils function, curious to get some initial feedback and to know if there'd be interest in this before moving on to the rest

The idea is that, instead of hard-coding support for pandas, you could make use of the DataFrame Consortium API, which is (wip in progress!) defined here: https://data-apis.org/dataframe-api/draft/index.html

Then, any function which adheres to that API will:

  • keep support for pandas (so, no regression in functionality)
  • also gain support for Polars, and any other dataframe library which implements the API (devs from cuDF and Modin also want to implement it)

Example

If you run

import pandas as pd
import polars as pl

from sklego.pandas_utils import add_lags

df = pd.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
print(add_lags(df, "X1", -1))

then it will print

   X1   X2  X1-1
0   1  178   0.0
1   2  154   1.0

, just like it does now. No change here.

However, you can now also run

import polars as pl

from sklego.pandas_utils import add_lags

df = pl.DataFrame({"X1": [0, 1, 2], "X2": [float('nan'), "178", "154"]})
result = add_lags(df, "X1", -1)
print(result.collect())

and you'll get

shape: (2, 3)
┌─────┬─────┬──────┐
│ X1X2X1-1 │
│ ---------  │
│ i64stri64  │
╞═════╪═════╪══════╡
│ 11780    │
│ 21541    │
└─────┴─────┴──────┘

(note how I had to add collect before printing the result, as the result is a Polars LazyFrame)

Dependencies

Note how in sklego/dataframe_utils.py, it's now possible to remove import pandas as pd. If this API were used throughout the package, then you wouldn't even need pandas as a required dependency - scikit-lego would be truly dataframe-agnostic, and would use whichever dataframe package the user passes.

All that would be needed would the dataframe-api-compat package. Note that it's light as a feather:

  • it's pure Python, with no dependencies (not even pandas / Polars, it'll just defer to whatever the user calls __dataframe_consortium_standard__ on)
  • the wheel is literally only about 20-30 kilobytes https://pypi.org/project/dataframe-api-compat/0.1.28/#files
  • it would be easy to vendor if you didn't want to take on an extra dependency

Checklist:

  • My code follows the style guidelines (flake8)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (also to the readme.md)
  • I have added tests that prove my fix is effective or that my feature works
  • I have added tests to check whether the new feature adheres to the sklearn convention
  • New and existing unit tests pass locally with my changes

If you feel your PR is ready for a review, ping @koaning or @MBrouns.

@MarcoGorelli MarcoGorelli marked this pull request as ready for review November 12, 2023 15:13
@MarcoGorelli MarcoGorelli changed the title WIP RFC Use dataframe api to also support Polars and other dataframe libraries RFC Use dataframe api to also support Polars and other dataframe libraries Nov 12, 2023
@FBruzzesi
Copy link
Collaborator

Hi @MarcoGorelli, as mentioned privately, this is a very exciting shift for the data ecosystem, and thank you for showcasing how to approach it.

My take on scikit-lego is that the adjustments are fairly tiny and quite easy to implement. The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

@koaning and @MBrouns what do you think?

@koaning
Copy link
Owner

koaning commented Nov 15, 2023

I'm also in favor of exploring this some more. While there are some dumps to discuss with other features (logging on a dataframe in Polars is different if the DF is lazy, I'd also need to double-check our fairness tools) ... but this initial change seems like a fair place to start!

@MarcoGorelli
Copy link
Contributor Author

Thanks!

Some parts may require a slight refactor, like TimeGapSplit, because the dataframe api intentionally does not have an index. But it shouldn't be too bad, something like

if isinstance(df, pd.DataFrame):
    # todo: check that `'__index__'` isn't already a column name
    df.reset_index().rename(columns={df.index: '__index__'})

and then do a join based on the column __index__, rather than relying on pandas' default auto-aligning on the index

I'll try taking this further then, let's see how far it can go!

The general concern is regarding what should be the process if some functionality is not implemented as api-standard.

I'd suggest a 3-phase approach:

  1. implement parallel logic, like
    if isinstance(df, pd.DataFrame):
        # pandas logic
    elif isinstance(df, (pl.DataFrame, pl.LazyFrame)):
        # Polars logic
    The imports could be done lazily so you'd still not need either pandas or polars as required runtime dependency
  2. open an issue at https://github.com/data-apis/dataframe-api-compat - we can add it there and make a new release quickly
  3. open an issue at https://github.com/data-apis/dataframe-api and try to make it part of the standard - this would matter if you wanted to eventually support other dataframes than "just" pandas and Polars

python-version: 3.8
python-version: 3.9
Copy link
Contributor Author

@MarcoGorelli MarcoGorelli Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lowest version pandas to have __dataframe_consortium_standard__ is 2.1, which only support python 3.9+ - would python3.9+ be acceptable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never know how reliable this is, yet from pypi stats it seems that the majority of downloads comes from 3.7 and 3.8 🫥

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah interesting - OK, keeping 3.8 then, we can work around that with a helper function, and only use the __dataframe_consortium_standard__ for 3.9+

timeline-wise, we're aiming for Feburary to make the first non-beta release

@MarcoGorelli MarcoGorelli marked this pull request as draft December 2, 2023 20:22
@MarcoGorelli
Copy link
Contributor Author

this has gone quite out-of-sync, will update soon-ish. trying to get some updates into the api design itself first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants